A New Probabilistic Model of Text Classi cation and Retrieval

نویسنده

  • Tom Kalt
چکیده

This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models , particularly the binary independence model and the 2-Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC-3 routing task. Results are compared with the binary independence model and with the simulation studies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Probabilistic Model of Text Classi cation and

This paper introduces the multinomial model of text classiication and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The mult...

متن کامل

A Term Association Translation Model for Naive Bayes Text Classification

Text classi cation (TC) has long been an important research topic in information retrieval (IR) related areas. In the literature, the bag-of-words (BoW) model has been widely used to represent a document in text classi cation and many other applications. However, BoW, which ignores the relationships between terms, o ers a rather poor document representation. Some previous research has shown tha...

متن کامل

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also sug...

متن کامل

Independent component analysis for understanding multimedia content

This paper focuses on using independent component analysis of combined text and image data from web pages. This has potential for search and retrieval applications in order to retrieve more meaningful and context dependent content. It is demonstrated that using ICA on combined text and image features provides a synergistic e ect, i.e., the retrieval classi cation rates increase if based on mult...

متن کامل

A Common Lisp Framework for Document Classi cation and Retrieval

This paper describes the Document Classi cation Substrate (DCS) and accompanying protocols. The DCS is a framework of Lisp support code facilitating the prototyping and deployment of systems for automatic document classi cation and retrieval applications. The DCS design re ects the following observations concerning the problem of classi cation of texts. 1. Initial preprocessing (lexical feature...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996